Quantile Regression: An Exploration of the 2023 Behavioral Risk Factor Screening Survey Data

Authors: Hamilton Hewlett & Laura Sikes, MPH, CPH
Group: Not Sure
Instructor: Dr. Samantha Seals
Date: November 19, 2024

Introduction: Regression Overview L

  • Regression is a method used to predict outcomes based on the behavior of variables within a dataset relative to a mean.
  • Of an infinite number of lines that can be drawn through the data, the one best describing an overall relationship is the one with the smallest sum of the square of the errors.
  • The errors measure the distance from each data point to the line, then each value is squared, and those squares are summed. 
  • The distances are squared in order to eliminate the negative properties of the distances that are found between the line and the data points below it. 
  • The line of best fit is one that produces the smallest sum of these squares. 
  • The formula for the regression line is

\[ \hat{y} = \beta_0 + \beta_1 x \]

Introduction: Regression Overview -H

In the regression line equation \[ \hat{y} = \beta_0 + \beta_1 x \] where the intercept \(\beta_0\) is \[ \beta_0 = \frac{\sum y}{n} - \beta_1 \frac{\sum x}{n} \] and where the slope \(\beta_1\) is \[ \beta_1 = \frac{n \sum x y - \sum x \sum y}{n \sum x^2 - \left(\sum x\right)^2} \]

Based on the above, predictions can be made about what would happen along the regression line where there are no data points.

Introduction: Regression Overview -H

  • Assumptions (Zambesi, 2024):

    • \(E\) expected value of error term is zero

    • All error terms have the same variance

    • Normality

    • Independent variables

      • The variables must not be dependent on each other when performing analysis

Introduction: Quantile Regression -L

  • Quantile regression (QR) is a statistical approach that divides a distribution into different sections and finds the median point inside each one. 
  • This approach is especially useful when a distribution contains exceedingly high or low values, given that medians are not as influenced by extreme values as compared to means. 
  • QR is often an excellent solution when only one upper or lower end of a distribution contains a key variable. 
  • Variables may enjoy unique relationships within different sections of a distribution (Buhai, 2005)
  • The equation for QR is

\[ Q_Y(\tau \mid X) = X \beta(\tau) \]

  • Q is the quantile function, τ represents the quantile of interest, Y is the response variable given the X predictor variable(s), β is the regression coefficient (Koenker, 2005).

Introduction: Quantile Regression -H

  • QR is a non-parametric method, meaning no assumptions are made about the sample size or the normality of the distribution (Cohen et al., 2016).
  • Koenker and Machado (1999) identified the absence of a suitable goodness-of-fit assessment for this statistical method.
  • The authors developed a method with inspiration from R2.  R2 is the coefficient of determination, which identifies the difference between the regression line and the mean. 
  • This coefficient’s values lie between 0 and 1, with values approaching 1 indicating ever-greater portions of the variation are explained by the relationship among the variables.

Introduction: Quantile Regression -H

  • Watch outs:

    • The more quantiles that are added, the smaller your sample size becomes. When you break your data into these sections you are actually breaking the sample population up and only looking at what fits in the given quantile. This can potentially cause validity issues if questioned about sample size (Lê Cook & Manning, 2013).

    • You also may run into crossing which shouldn’t happen but seems inevitable. It is a worry that would make you think this method should be deemed invalid. Think of 4 equations across a data set split up between the varying sections within the data. Knowing that unless a line is perfectly parallel with another, intersection is deemed to happen at some point. A non crossing model is attainable (Jiang & Yu, 2023).

Introduction: Quantile Regression, Goodness of Fit -L

The formula for R2 is as follows:

\[ R^2 = 1 - \frac{\sum \left(y_i - \hat{y}\right)^2}{\sum \left(y_i - \bar{y}\right)^2} \] - The sum of observed values \(y_i\)  is subtracted by the sum of fitted values \(\hat{y}\) and squared, forming the numerator. 
- The sum of observed values \(y_i\) minus the sum of sample mean \(\bar{y}\) and then squared, forming the denominator.
- This formula may also be represented as \[ R^2 = 1 - \frac{RSS}{TSS} \] where RSS is the residual sum of squares and TSS is the total sum of squares.
- The post-hoc testing methods Koenker and Machado proposed are Δn(τ), Wn(τ), and Tn(τ), which the authors claimed would vastly enhance the capacity for quantile regression inference. 
- The theory behind these methods is beyond the scope of this paper.

Introduction: Quantile Regression Applications -L

  • Applications for QR are found in a diverse array of fields, including economics, ecology, real estate, and healthcare.
  • Olsen et al. (2017) employed QR in a retrospective cohort study using data from insurer claims to identify factors affecting costs of infections following inpatient surgical procedures.
  • Serious infections were categorized as those occurring in hospital or needing follow-up surgery, and were found to be the most costly.
  • QR provided a method to focus on different factors associated with costs for the serious surgical infections located within that quantile. 

Introduction: Quantile Regression Applications -H

  • Waldmann (2018) utilized body mass index (BMI) data to illustrate the usefulness of QR.

  • The author’s goal was to answer questions about the outer portions of the distribution, which in this case was boys with the highest BMI.

  • QR afforded the author insight into how predictor variables can have stronger or weaker effects within different areas of the distribution. 

  • Cohen et al. (2016) used QR to evaluate the relationships between community demographics and perceived resilience. 

  • The authors divided the distribution into percentiles and quartiles, and found that significant relationships existed inside several quantiles that were not significant when using linear regression on the entire distribution. 

Introduction: Quantile Regression Applications -L

  • Staffa et al. (2019) employed QR to understand relationships between ventilator dependence and hospital length of stay, suspecting that these relationships changed across different quantiles. -The authors found stronger relationships between the predictor variables inside the upper quantiles of the length of stay outcome. 

  • Machado and Silva (2005) explored the use of QR for count data by incorporating artificial smoothness, allowing inferences to be made about the conditional quartiles.  The authors note that count data is often severely skewed, making it an excellent candidate for QR which is more resilient against skewness given its use of medians over means.

Introduction: Quantile Regression Applications -H

  • Le Cook and Manning (2013) showed that through quantile regression, raising taxes on alcohol did not reduce consumption rates of heavy alcohol users as it was intended. This study broke users into three groups, light, moderate, and heavy drinkers. Results using quantile regression showed that the light and heavy drinkers were not sensitive to the increase in price. Study showed that it only deterred moderate drinkers.

  • Le Cook and Manning (2013) also discuss how quantile regression is necessary when observing health care expenditures. They show that users can be more effectively analyzed when broken into sub groups depending on healthcare usage. This sheds more light on user’s status such as race, employment, gender, and insurance status among others. This is far more effective in analyzing data rather than regression based off the mean. This will not tell you the full story.

Methodology: Study Design and Data -L

  • Cross-sectional study using 2023 Behavioral Risk Factor Surveillance System (BRFSS) published by Centers for Disease Control and Prevention (CDC).
  • BRFSS is a yearly survey conducted via land-line and cellular telephone (CDC, 2024).
  • Includes U.S. residents aged 18 years and older.
  • Survey topics include demographic information, socioeconomic factors, physical and mental health status, chronic conditions, and behaviors affecting physical and mental health.
  • There is a minimum data collection requirement. Kentucky and Pennsylvania were omitted for not meeting that requirement.

Methodology: Research Question - Hewlett Project

  • I am looking to see if height and weight are significant predictors of the age participants are when informed they have diabetes.

Methodology: Predictor Variables - Hewlett Project

  • DIABETE4 is a variable used for participants to share if they have ever been told they have diabetes. The acceptable responses were 1 (Yes), 2 (Yes, but female told only during pregnancy), 3 (No), 4 (No, pre-diabetes or borderline diabetes), 7 (Don’t know/not sure), 9 (Refused). For the sake of this analysis, we stratified everything by 1 (Yes).

  • WEIGHT2 is a variable for participants to share how much they weigh without shoes on. The responses were acceptable 50-0776 (Weight in pounds), 7777 (Don’t know/Not sure), 9023 - 9352 (Weight in kilograms), 9999 (Refused), and BLANK (Not asked or Missing).

  • HEIGHT3 is a variable that allowed participants to share their height without shoes on. The acceptable responses recorded that I used are imperial. The metric responses were much larger and made sense to be metric.

Methodology: Response Variable - Hewlett Project

  • DIABAGE4 is a variable that allows participants to share how old they were when told they first had diabetes. The acceptable responses are 1-97 (Age in years), 98 (Don’t know/Not sure), 99 (Refused), and BLANK (Not asked or Missing)

  • To ensure smooth data acquisition. I ensured that this variable is only present in the stratified DIABETE4 (1) in the Table 1.

Methodology: Analytic Methods - Hewlett Project

Stratified by DIABETE4
1 2 3 4 7 9
n 1952 78 9764 222 28 2
DIABAGE4 (mean(SD)) 54.39 (18.62) - - - - -
WEIGHT2 (mean(SD)) 199.36 (50.47) 164.17 (36.29) 177.37 (45.32) 196.36 (51.12) 179.36 (55.79) 153.00 (62.23)
HEIGHT3 (mean(SD)) 610.68 (837.91) 1115.53 (2145.01) 709.05 (1238.72) 738.83 (1335.65) 508.54 (31.71) 506.00 (1.41)

Methodology: Analytic Methods - Hewlett

Figure 1.

Methodology: Analytical Methods - Hewlett

Figure 2.

Methodology: Analytic Methods - Hewlett

  • Started out considering only Age ~ Weight

  • Expanded into Age ~ Weight + Height

  • Created models using rq() from the quant reg package.

    • Found coefficients to create regression lines

    • Plotted results using ggplot2

  • Created confidence intervals from scratch due to vcov error when running rq() models and trying to use the confint() function.

Methodology: Response Variable - Sikes Project

  • Days per month a respondent’s poor health affected his or her activities of daily living (ADL)
  • Respondents asked to quantify the number of days per month having poor mental or physical health interfered with their ability to conduct ADLs.
  • Available responses were numeric values 1 – 30, none, refused, or don’t know/not sure.
  • The authors note that quantile regression is a non-parametric method, meaning no assumptions are made about the sample size or the normality of the distribution.

Methodology: Predictor Variables - Sikes Project

  • Weight: Available responses including numeric values 50 – 776 in pounds, don’t know/not sure, refused, and a separate indicator for weight measured in kilograms.
  • Education: Available responses included never attended school or only kindergarten, grades 1 through 8, grades 9 through 11, grade 12 or GED, 1 to 3 years of college, 4 years of college or more, and refused.
  • Income in U.S. dollars: Available responses included <10,000, $10,000 to < $15,000, $15,000 to < $20,000, $20,000 to < $25,000, $25,000 to < $35,000, $35,000 to < $50,000, $50,000 to < $75,000, $75,000 to < $100,000, $100,000 to < $150,000, $150,000 to < $200,000, $200,000 or more, don’t know/not sure, and refused.
  • Physical health: Number of days during the previous 30 days in which the respondent’s physical or mental health were not good, with available responses including numeric values 1 - 30, refused, or don’t know / not sure.
  • Sex: Available responses were limited to male or female.

Methodology: Operationalization of Variables - Sikes Project

  • Dataset filtered to include only residents of the state of Florida and removed of values containing “Refused,” “Not sure / Don’t know”, or missing.
  • Response variable ADL was filtered to include only numeric responses, and recoded to four groups as follows: “1-7 Days”, “8-14 Days,” “15-21 Days”, and “>21 Days”.
  • Education was recoded as follows: “Less than High School,”, “High School or GED,” “Some College”, and “4 Years of College or Higher.
  • Income was recoded into six groups as follows: “<$25,000”, “$25,000-$34,999”, “$35,000-$49,000”, “$50,000-$74,999”, “$75,000-$99,999”, and “>$100,000”.
  • Physical health and mental health were each recoded into four groups as follows: “1-7 Days”, “8-14 Days,” “15-21 Days”, and “>21 Days”.
  • All statistical analyses were performed using RStudio version 2024.04.2, build 764, “Chocolate Cosmos” release for Windows (Posit Software, PBC, 2024).

Methodology: Analytical Methods - Sikes Project

  • QR was performed to explore ADL as a function of weight.
  • The distribution was divided into quartiles (\(\tau\) = 0.25, 0.5, 0.75) to find the median values within each.
  • Ordinary least squares regression (OLS) was performed using the same variables.

Results: Hewlett Project

Table 1.

MM25 MM50 MM75
Height
Coefficient 0.00501 -0.00296 0.00466
Upper 95% Confidence Interval 0.028 0.018 0.03
Lower 95% Confidence Interval -0.018 -0.024 -0.021
P-Value 0.66932 0.78491 0.71881
Weight
Coefficient -0.02405 -0.04386 -0.0634
Upper 95% Confidence Interval -0.016 -0.03 -0.048
Lower 95% Confidence Interval -0.032 -0.057 -0.079
P-Value 0 0 0

Results: Hewlett Project

Diabetics age informed of diagnosis as a function of weight and height.

Results: Descriptive Statistics - Sikes Project

  • Majority (42.6%) of considered respondents reported between 1 to 7 days of poor mental or physical health negatively impacting ADL, followed by >21 days at 26.3% (Tables 2a - 2d).
  • Nearly one-third (31.9%) reported having greater than 21 days of poor physical health per month, while a similar portion (30.2%) reported having over 21 days of poor physical health.
  • Over one-half (50.2%) reported annual income of under $34,000, with just 13.1% reporting an income of over $100,000.
  • Weight ranged from 82 to 500 pounds with a median of 180 (SD = 53.58).
  • This sample is heavily skewed female with only 39.1% (n=493) identifying as male.

Results: Descripive Statistics - Sikes Project

Table 2a.

Category n = 1319 %
Self-Assessed General Health
   Excellent 42 3.3
   Very Good 205 16.3
   Good 383 30.4
   Fair 424 33.6
   Poor 265 21.0
Poor Physical Health Days / Mo.
   1-7 537 42.6
   8-14 168 13.3
   15-21 212 16.8
   >21 402 31.9

Results: Descripive Statistics - Sikes Project2

Table 2b.

Category n = 1319 %
Poor Mental Health Days / Mo.
   1-7 481 38.1
   8-14 210 16.7
   15-21 247 19.6
   >21 381 30.2
Poor Health Days Affected ADL / Mo.
   1-7 549 43.5
   8-14 191 15.1
   15-21 247 19.6
   >21 332 26.3

Results: Descripive Statistics - Sikes Project

Table 2c.

Category n = 1319 %
Education
   Less than High School 104 8.2
   High School or GED 345 27.4
   Some College 450 35.7
   4 Years College or More 420 33.3

Results: Descripive Statistics - Sikes Project

Table 2d.

Category n = 1319 %
Annual Income
   ≤ $24,999 413 32.8
   $25,000 - $34,999 220 17.4
   $35,000 - $49,999 183 14.5
   $50,000 - $74,999 192 15.2
   $75,000 - $99,999 146 11.6
   ≥$100,000 165 13.1
Sex
   Male 493 39.1
   Female 826 65.5

Results: Quantile Regression - Sikes Project

Table 3.

Quartile Poor ADL Days Coefficient 95% Lower CI 95% Upper CI p-Value
0.25 0.005 -0.002 0.012 0.158
0.50 0.000 -0.013 0.013 1.000
0.75 0.032 -0.004 0.068 0.078

Note. \(\alpha\) = 0.05

Results: Quantile Regression - Sikes Project

Figure 3.

Days per Month ADLs Affected by Poor Mental or Physical Health as a Function of Weight (lbs) Graph

Discussion: Hewlett Project

  • The predictor variable weight turned out to be a significant predictor. The p value was less than alpha. This was true in all quantiles.

  • The predictor variable height turned out not to be a significant predictor. The p value was greater than alpha. This was true in all quantiles.

  • It seems that weight is a significant predictor for determining the age you are informed you have diabetes based off my models, and the BRFSS data. Height however, seems to not be a significant predictor for the age you are informed you have diabetes.

  • We could also see in the MM25 model (0.25 quantile) the responses of refusal and don’t know were skewing that regression line.

Discussion: Sikes Project

Table 3.

Quartile Poor ADL Days Coefficient 95% Lower CI 95% Upper CI p-Value
0.25 0.005 -0.002 0.012 0.158
0.50 0.000 -0.013 0.013 1.000
0.75 0.032 -0.004 0.068 0.078
  • At the first quartile (\(\tau\) = 0.25), a coefficient of 0.005 (95% CI = -0.002, 0.012) indicates that for each one pound increase in weight, the number of days per month of ADL affected by poor mental or physical health increases by 0.005 days at the \(\alpha\) = 0.95 level (p=0.158) (Table 3). This finding is not statistically significant.
  • At the second quartile (\(\tau\) = 0.50), a coefficient of 0.000 (95% CI = -0.013, 0.013) indicates that for each one pound increase in weight, the number of days per month of ADL affected by poor mental or physical health is not detected at the \(\alpha\) = 0.95 level (p=1.000). This finding is not statistically significant.
  • At the third quartile (\(\tau\) = 0.75), a coefficient of 0.032 (95% CI = -0.004, 0.068) indicates that for each one pound increase in weight, the number of days per month of ADL affected by poor mental or physical health increases by 0.032 days at the \(\alpha\) = 0.05 level (p=0.078). This finding is not statistically significant.

Discussion: Sikes Project

  • Since the third quartile (\(\tau\) = 0.75) was marginally but not statistically significant with a p-value of 0.078 at the \(\alpha\) = 0.05 level, it may be useful to investigate these relationships further.
  • Weight (lbs) may have a larger effect at the higher extreme of the distribution not examined in this study.
  • Future studies could examine the effect at the ninetieth (\(\tau\) = 0.90) and ninety-fifth (\(\tau\) = 0.95) percentiles.

Conclusion

  • QR is well-represented within the literature and has numerous advantages over traditional regression models when relationships between variables are suspected to change across quantiles.
  • QR is an excellent analytical method to employ when stronger relationships are suspected at the extremes of the distribution.
  • With applications from real estate to life-and-death healthcare outcomes, QR affords a unique perspective into the nuances of our data.

References

Buhai, S. (2005). Quantile regression: Overview and selected applications. Ad Astra, 4(4), 1–17.
CDC. (2024). 2023 BRFSS survey data and documentation. https://www.cdc.gov/brfss/annual_data/annual_2023.html.
Cohen, O., Bolotin, A., Lahad, M., Goldberg, A., & Aharonson-Daniel, L. (2016). Increasing sensitivity of results by using quantile regression analysis for exploring community resilience. Ecol. Indic., 66, 497–502.
Jiang, R., & Yu, K. (2023). No-crossing single-index quantile regression curve estimation. J. Bus. Econ. Stat., 41(2), 309–320.
Koenker, R. (2005). Quantile regression. Cambridge University Press.
Koenker, R., & Machado, J. A. F. (1999). Goodness of fit and related inference processes for quantile regression. J. Am. Stat. Assoc., 94(448), 1296.
Lê Cook, B., & Manning, W. G. (2013). Thinking beyond the mean: A practical guide for using quantile regression methods for health services research. Shanghai Arch. Psychiatry, 25(1), 55–59.
Machado, J. A. F., & Silva, J. M. C. S. (2005). Quantiles for counts. J. Am. Stat. Assoc., 100(472), 1226–1237.
Olsen, M. A., Tian, F., Wallace, A. E., Nickel, K. B., Warren, D. K., Fraser, V. J., Selvam, N., & Hamilton, B. H. (2017). Use of quantile regression to determine the impact on total health care costs of surgical site infections following common ambulatory procedures. Ann. Surg., 265(2), 331–339.
Staffa, S. J., Kohane, D. S., & Zurakowski, D. (2019). Quantile regression and its applications: A primer for anesthesiologists. Anesth. Analg., 128(4), 820–830.
Waldmann, E. (2018). Quantile regression: A short story on how and why. Stat. Modelling, 18(3-4), 203–218.
Zambesi, A. (2024). Correlation and regression overview. Module 7, University of West Florida.